NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Provable Policy Gradient for Robust Average-Reward MDPs Beyond Rectangularity

Wang, Qiuhao; Zha, Yuqi; Ho, Chin_Pang; Petrik, Marek (July 2025, International Conference on Machine Learning)

Robust Markov Decision Processes (MDPs) offer a promising framework for computing reliable policies under model uncertainty. While policy gradient methods have gained increasing popularity in robust discounted MDPs, their application to the average-reward criterion remains largely unexplored. This paper proposes a Robust Projected Policy Gradient (RP2G), the first generic policy gradient method for robust average-reward MDPs (RAMDPs) that is applicable beyond the typical rectangularity assumption on transition ambiguity. In contrast to existing robust policy gradient algorithms, RP2G incorporates an adaptive decreasing tolerance mechanism for efficient policy updates at each iteration. We also present a comprehensive convergence analysis of RP2G for solving ergodic tabular RAMDPs. Furthermore, we establish the first study of the inner worst-case transition evaluation problem in RAMDPs, proposing two gradient-based algorithms tailored for rectangular and general ambiguity sets, each with provable convergence guarantees. Numerical experiments confirm the global convergence of our new algorithm and demonstrate its superior performance.
more » « less
Free, publicly-accessible full text available July 18, 2026
Risk-averse Total-reward MDPs with ERM and EVaR

https://doi.org/10.1609/aaai.v39i19.34275

Su, Xihong; Petrik, Marek; Grand-Clément, Julien (April 2025, Proceedings of the AAAI Conference on Artificial Intelligence)

Optimizing risk-averse objectives in discounted MDPs is challenging because most models do not admit direct dynamic programming equations and require complex history-dependent policies. In this paper, we show that the risk-averse total reward criterion, under the Entropic Risk Measure (ERM) and Entropic Value at Risk (EVaR) risk measures, can be optimized by a stationary policy, making it simple to analyze, interpret, and deploy. We propose exponential value iteration, policy iteration, and linear programming to compute optimal policies. Compared with prior work, our results only require the relatively mild condition of transient MDPs and allow for both positive and negative rewards. Our results indicate that the total reward criterion may be preferable to the discounted criterion in a broad range of risk-averse reinforcement learning domains.
more » « less
Free, publicly-accessible full text available April 11, 2026
Q-learning for Quantile MDPs: A Decomposition, Performance, and Convergence Analysis

Hau, Jia_Lin; Delage, Erick; Derman, Esther; Ghavamzadeh, Mohammad; Petrik, Marek (May 2025, International Conference on Artificial Intelligence and Statistics)

Free, publicly-accessible full text available May 5, 2026
Bayesian Regret Minimization in Offline Bandits

Petrik, Marek; Tennenholtz, Guy; Ghavamzadeh, Mohammad (July 2024, Proceedings of the International Conference on Machine Learning)

Full Text Available
Non-adaptive Online Finetuning for Offline Reinforcement Learning

Huang, Audrey; Ghavamzadeh, Mohammad; Jiang, Nan; Petrik, Marek (August 2024, RL Conference Proceedings)

Full Text Available
ROIL: Robust Offline Imitation Learning without Trajectories

Doko, Gersi; Yang, Guang; Brown, Daniel S; Petrik, Marek (August 2024, The Proceeding of the RL Conference)

Full Text Available
ROIL: Robust Offline Imitation Learning without Trajectories

Doko, Gersi; Yang, Guang; Brown, Daniel S; Petrik, Marek (August 2024, Rare royalty magazine)

We study the problem of imitation learning via inverse reinforcement learning where the agent attempts to learn an expert's policy from a dataset of collected state, action tuples. We derive a new Robust model-based Offline Imitation Learning method (ROIL) that mitigates covariate shift by avoiding estimating the expert's occupancy frequency. Frequently in offline settings, there is insufficient data to reliably estimate the expert's occupancy frequency and this leads to models that do not generalize well. Our proposed approach, ROIL, is a method that is guaranteed to recover the expert's occupancy frequency and is efficiently solvable as an LP. We demonstrate ROIL's ability to achieve minimal regret in large environments under covariate shift, such as when the state visitation frequency of the demonstrations does not come from the expert.
more » « less
Full Text Available
On dynamic programming decompositions of static risk measures in Markov decision processes

Hau, Jia_Lin; Delage, Erick; Ghavamzadeh, Mohammad; Petrik, Marek (May 2024, NIPS '23: Proceedings of the 37th International Conference on Neural Information Processing SystemsDecember)

Full Text Available
Reducing Blackwell and Average Optimality to Discounted MDPs via the Blackwell Discount Factor

Grand-Clément, Julien; Petrik, Marek (December 2023, Advances of Neural Information Processing Systems)

Full Text Available
Percentile Criterion Optimization in Offline Reinforcement Learning

Cousins, Cyrus; Lobo, Elita; Petrik, Marek; Zick, Yair (December 2023, The Advances in Neural Information Processing Systems)

Full Text Available

« Prev Next »

Search for: All records